A Comparison of D and D Data Mapping for Sparse LU Factorization with Partial Pivoting
نویسندگان
چکیده
This paper presents a comparative study of two data mapping schemes for parallel sparse LU factorization with partial pivoting on distributed memory machines Our previous work has developed an approach that incorporates static symbolic factoriza tion nonsymmetric L U supernode partitioning and graph scheduling for this problem with D column block mapping The D mapping is commonly considered more scal able for general matrix computation but is di cult to be e ciently incorporated with sparse LU because partial pivoting and row interchanges require frequent synchronized inter processor communication We have developed an asynchronous sparse LU algo rithm using D block mapping and obtained competitive performance on Cray T D We report our preliminary studies on speedups scalability communication cost and memory requirement of this algorithm and compare it with the D approach Introduction E cient parallelization for sparse LU factorization with pivoting is important to many scienti c applications Di erent from sparse Cholesky factorization for which the parallelization problem has been relatively well solved the sparse LU factorization is much harder to be parallelized due to its dynamic nature caused by pivoting operations The previous work has addressed parallelization issues using shared memory platforms or restricted pivoting In we proposed a novel approach that integrates three key strategies together in parallelizing this algorithm on distributed memory machines adopt a static symbolic factorization scheme to eliminate the data structure variation caused by dynamic pivoting identify data regularity from the sparse structure obtained by the symbolic factorization so that e cient dense operations can be used to perform most of the computation make use of graph scheduling techniques and e cient run time support called RAPID to exploit irregular parallelism The preliminary experiments are encouraging and good performance results are obtained with D data mapping In the literature D mapping has been shown more scalable than D for dense LU factorization and sparse Cholesky In a sparse solver with element wise D mapping is presented For better cache performance block partitioning is preferred However there are several di culties to apply the D block oriented mapping to the case of sparse LU factorization even the static structure is predicted in advance First of all pivoting operations and row interchanges require frequent and well synchronized inter processor communication when submatrices in the same column block are assigned to Supported by NSF RIA CCR and CDA and by a startup fund from University of California at Santa Barbara Department of Computer Science University of California Santa Barbara CA di erent processors In order to exploit irregular parallelism we need to deal with irregular and asynchronous communication which requires delicate message bu er management Secondly it is di cult to model irregular parallelism from sparse LU Using the elimination tree of A A is possible but not accurate Lastly the space complexity is another issue Exploiting irregular parallelism to a maximum degree may need more bu er space This paper presents our preliminary work on the design of an asynchronous algorithm for sparse LU with D mapping Section brie y reviews the static symbolic factorization sparse matrix partitioning and D algorithms Section presents the D algorithm Section discusses the advantages and disadvantages of the D and D codes Section presents the experimental results Section concludes the paper Background and D approaches The purpose of sparse LU factorization is to nd two matrices L and U for a given nonsymmetric sparse matrix A such that PA LU where L is a unit lower triangular matrix U is a upper triangular matrix and P is a permutation matrix containing pivoting information It is very hard to parallelize sparse LU because the pivot selection and row interchange dynamically increase ll ins and change L U data structures Using the precise pivoting information at each elimination step can certainly optimize data space usage and improve load balance but lead to high run time overhead In this section we brie y discuss some techniques used in our S parallel sparse LU algorithm Static symbolic factorization Static symbolic factorization is proposed in to identify the worst case nonzero patterns The basic idea is to statically consider all the possible pivoting choices at each step And then the space is allocated for all the possible nonzeros that would be introduced by any pivoting sequence that could occur during the numerical factorization The static approach avoids data structure expansions during the numerical factorization The dynamic factorization which is used in an e cient sequential code SuperLU provides more accurate data structure prediction on the y but it is challenging to parallelize SuperLU on distributed memory machines Currently the SuperLU group has been working on shared memory parallelizations L U supernode partitioning After the nonzero ll in patterns of a matrix is predicted the matrix is further partitioned using a supernode approach to improve the cache performance In a nonsymmetric supernode is de ned as a group of consecutive columns in which the corresponding L factor has a dense lower triangular block on the diagonal and the same nonzero pattern below the diagonal Based on this de nition in each column block the L part only contains dense subrows We call this partitioning method L supernode partitioning Here by subrow we mean the contiguous part of a row within a supernode For SuperLU it is di cult to explore structure regularity in a U factor after L supernode partitioning However in our approach nonzeros including overestimated ll ins in the U factor can be clustered as dense columns or subcolumns We discuss the U partitioning strategy as follows After a L supernode partition has been obtained on a sparse matrix A i e a set of column blocks with possible di erent block sizes the same partitioning is applied to the rows of the matrix to further break each supernode into submatrices Now each o diagonal submatrix in the L part is either a dense block or contains dense blocks Furthermore in we have shown that each nonzero submatrix in the U factor of A contains only dense subcolumns This is the key to maximizing the use of BLAS subroutines in our algorithm Figure a illustrates the dense patterns in a partitioned sparse matrix Data Mapping For block oriented matrix computation D column block cyclic mapping and D block cyclic mapping are commonly used In D column block cyclic mapping as illustrated in Figure b a column block A j is assigned to the same processor Pj mod p where p is the number of the processors Each column block is called a panel in A D block cyclic mapping views the processors as a D r s grid and a block is the minimum unit for data mapping A nonzero submatrix block Ai j is assigned to the processor Pi mod r j mod s as illustrated in Figure c The D mapping is usually considered more scalable than D mapping for Cholesky because it tends to have better computation load balance and lower complexity of communication volume However the D mapping introduces more overhead for pivoting and row swapping
منابع مشابه
A Comparison of 1-D and 2-D Data Mapping for Sparse LU Factorization with Partial Pivoting
This paper presents a comparative study of two data mapping schemes for parallel sparse LU factorization with partial pivoting on distributed memory machines. Our previous work has developed an approach that incorporates static symbolic factoriza-tion, nonsymmetric L/U supernode partitioning and graph scheduling for this problem with 1-D column-block mapping. The 2-D mapping is commonly conside...
متن کاملParallel Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures
Gaussian elimination based sparse LU factorization with partial pivoting is important to many scientiic applications, but it is still an open problem to develop a high performance sparse LU code on distributed memory machines. The main diiculty is that partial pivoting operations make structures of L and U factors unpredictable beforehand. This paper presents an approach called S for paralleliz...
متن کاملEecient Sparse Lu Factorization with Partial Pivoting on Distributed Memory Architectures
A sparse LU factorization based on Gaussian elimination with partial pivoting (GEPP) is important to many scientiic applications, but it is still an open problem to develop a high performance GEPP code on distributed memory machines. The main diiculty is that partial pivoting operations dynamically change computation and nonzero ll-in structures during the elimination process. This paper presen...
متن کاملParallel Sparse Gaussian Elimination with Partial Pivoting and 2-d Data Mapping Parallel Sparse Gaussian Elimination with Partial Pivoting and 2-d Data Mapping Abstract Parallel Sparse Gaussian Elimination with Partial Pivoting and 2-d Data Mapping
Sparse Gaussian elimination with partial pivoting is a fundamental algorithm for many scientiic and engineering applications, but it is still an open problem to develop a time and space eecient algorithm on distributed memory machines. In this thesis, we present an asynchronous algorithm which incorporates static symbolic factorization, nonsymmetric L/U supernode partitioning and supern-ode ama...
متن کاملEfficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures
A sparse LU factorization based on Gaussian elimination with partial pivoting (GEPP) is important to many scientific applications, but it is still an open problem to develop a high performance GEPP code on distributed memory machines. The main difficulty is that partial pivoting operations dynamically change computation and nonzero fill-in structures during the elimination process. This paper p...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003